Corpus selection and online data collection

Malo Jan & Luis Sattelmayer

2025-01-13

Corpus selection

What is a corpus ?

  • Every project involving text analysis starts with a corpus
  • A collection of machine readable texts, along with metadata
  • Corpus selection is a crucial task and often overlooked
  • Corpus selection is sampling : identifying the population of interest
  • Corpus selection is case selection : choosing documents to analyze according to research questions and interests
  • What would be a corpus that would help me answer my research question ?
    • E.g.: Twitter data is fun but not useful for everything

Corpus selection and bias

  • Several aspects to consider when selecting a corpus
  • How exhaustive is the corpus ?
    • Can I obtain the entirety of a document type/collection?
  • Think about data generation process
    • How were these texts produced?
    • How could this affect the corpus?
  • Important to explore, read some of the texts
  • Think about quantity of interest, what could you derive as a measure from those texts :
    • eg. How much a given topic is discussed in these texts
    • eg. How much a given sentiment is expressed in these texts

Four biases in text selection

  • Resource Bias
    • “texts often better reflect populations with more ressources to produce, record, and store documents” (Grimmer, Roberts, and Stewart 2022)
    • social groups are marginalized and invisible in all societies, thus also in text
    • eg. Twitter data is not representative of the general population; it can only speak for behavior on Twitter
  • Incentive Bias
  • Medium Bias
    • Twitter only allowed 140 characters until 2017 or meeting transcripts/speeches do not necessarily reflect the tonality of the actual communication
  • Retrieval Bias
    • errors or skewed queries in text mining, intransparent APIs

Acquiring a corpus

Transforming videos, audios and images into text

Transcribing audio and video files

  • A lot of data is in audio/video format such as interviews, radio shows, speeches, podcasts, etc.
  • This data can be analyzed with text analysis tools through transcription
  • Potential pipeline : download videos from youtube and transcribe audio with whisper

Corpus creation with digitized texts

  • Important avenues for research in the digital humanities
  • Digitization of old texts, newspapers, books, archives
  • Allows to create large original corpora
  • Requires OCR (Optical Character Recognition) tools
    • Transforms images of texts into machine-readable text
    • Tools such as Tesseract, TrOCR

APIs

Application programming interface (API)

  • One important way to access data online
  • Access to data through
    • Authentication with an API key
    • Querying the API with specific requests
    • Receiving data in a structured format (JSON, XML)

Advantages of APIS

Limits of APIs

  • Access only to the data that the API provider wants to share
  • Often limited in the number of requests you can make
  • Some APIs are not free : eg. Twitter API now
  • APIcalypse/Post-API Age : APIs can be closed or modified at any time
    • Facebook, or Twitter that had before extensive API access for researchers but now closed (see minet for some workarounds)

Learning how to use an API

  • This course will not cover how to use APIs
  • But there are many resources online, here
  • Request data from API with httr2
  • Parse Json or XML in R with jsonlite

Webscraping

What is web scraping?


Web scraping is a data collection technique on the web that involves extracting data from a web page.

  • We are all manual web scrapers: copying/pasting/downloading data from the web daily.
  • But we can automate all of this: machines are faster and more extensive than we are.

Use cases

  • Collecting social media data
  • Collecting press releases from an organization, speeches from actors
  • Extracting data from Wikipedia pages
  • Automatically download hundreds of PDFs from a website
  • Collect metadata from a website

How the web is written

  • Web scraping involves a minimum understanding of how the web is written: what is a web page?
    • Code: interpreted by a browser (ex: Chrome, Mozilla)
  • The code of a web page can be written in:
    • HTML (Hypertext markup language): structure and content
    • CSS: style (ex: font, color)
    • Javascript: functionalities, dynamic content, search, drop-down menus etc.
  • Exemple from the press release of the UN Secretary General

HTML

  • HTML is a markup language
  • The content of the web is written in tags, which can have attributes

The tree-like structure of an HTML document

HTML

  • Most common tags in html
    • div
    • p
    • h1, h2, h3
  • Web scraping consists in extracting the content of the html source code to get certain information
  • CTRL + U
  • Selector Gadget

Ethics on data access

  • Web scraping is about collecting data not intended for you
    • May be illegal but also exceptions for research purposes
    • Access to private data, overload servers
    • Websites can block you/track you
  • Good practices :
    • API first
    • Check permissions (robot.txt)
    • Slow down scraping

Ethics on data use

  • Web scraping can be used to collect personal and sensitive data
  • E.g.: Old twitter API: possibility to download all tweets of a user in 2 lines of code
  • Data protection laws (GDPR) apply to web scraping
  • Need to think about the use of the data

Complex and dynamic pages

  • Web pages are sometimes complex and dynamic (javascript)
  • More complex scraping strategies are needed
  • Selenium: allows to simulate clicks on a browser from a script
  • Web pages are also changed, updated: your program can work one day and not the next, good practice to download the hmtl pages on your computer

References

Grimmer, Justin, Margaret E Roberts, and Brandon M Stewart. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
King, Gary, Jennifer Pan, and Margaret E Roberts. 2013. “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review 107 (2): 326–43.